{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d521fcdd-39f6-4c5d-9466-8e6f7fb8ab2e",
   "metadata": {},
   "source": [
    "# No code configuration"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a533b0a0-7acc-4164-ad2c-cb64b3d25aca",
   "metadata": {},
   "source": [
    "No-code configuration can be helpful in three scenarios:\n",
    "\n",
    "1. There's an existing set of regular expressions / deny-lists that should be leveraged within Presidio.\n",
    "2. As a simple way to configure which recognizers to enable and disable, and how to configure the NLP engine.\n",
    "3. For team members interested in changing the configuration without writing code.\n",
    "\n",
    "In this example, we'll show how to create a no-code configuration in Presidio.\n",
    "We start by creating YAML configuration files that are based on the default ones. \n",
    "Te default configuration files for Presidio can be found here:\n",
    "- [Analyzer configuration](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default_analyzer.yaml)\n",
    "- [Recognizer registry configuration](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml)\n",
    "- [NLP engine configuration](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default.yaml)\n",
    "\n",
    "Alternatively, one can create one configuration file for all three components.\n",
    "In this example, we'll tweak the configuration to reduce the number of predefinedrecognizers to only a few, and add a new custom one. We'll also adjust the context words to support the detection of a different language (Spanish).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "6cd78f11-d3b3-43f9-8cc5-6a08b52e2403",
   "metadata": {},
   "outputs": [],
   "source": [
    "import yaml\n",
    "import json\n",
    "import tempfile\n",
    "import warnings\n",
    "from pprint import pprint\n",
    "from presidio_analyzer import AnalyzerEngineProvider\n",
    "\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b864e1b-a2a3-4ed8-be43-092f02e56d55",
   "metadata": {},
   "source": [
    "In this example we're going to create the yaml as a string for illustration purposes, but the more common scenario is to create these YAML files and load them into the `PresidioAnalyzerProvider`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9894a09f-9df2-4afa-8c3f-3b2a0f72f270",
   "metadata": {},
   "source": [
    "### General Analyzer parameters\n",
    "([default file](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default_analyzer.yaml))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "2f1d8b01-f9d2-4827-a1d1-ba4d6ca70daf",
   "metadata": {},
   "outputs": [],
   "source": [
    "analyzer_config_yaml = \"\"\"\n",
    "supported_languages: \n",
    "  - en\n",
    "  - es\n",
    "default_score_threshold: 0.4\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f423e92-1eb4-4f85-9427-578aef7dc25f",
   "metadata": {},
   "source": [
    "### Recognizer Registry parameters\n",
    "([default file](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "4c3b6a93-79a5-467e-b08d-d59f7ea461e9",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "recognizer_registry_config_yaml = \"\"\"\n",
    "recognizer_registry:\n",
    "  supported_languages: \n",
    "  - en\n",
    "  - es\n",
    "  global_regex_flags: 26\n",
    "\n",
    "  recognizers:\n",
    "  - name: CreditCardRecognizer\n",
    "    supported_languages:\n",
    "    - language: en\n",
    "      context: [credit, card, visa, mastercard, cc, amex, discover, jcb, diners, maestro, instapayment]\n",
    "    - language: es\n",
    "      context: [tarjeta, credito, visa, mastercard, cc, amex, discover, jcb, diners, maestro, instapayment]\n",
    "    type: predefined\n",
    "    \n",
    "  - name: DateRecognizer\n",
    "    supported_languages:\n",
    "    - language: en\n",
    "      context: [date, time, birthday, birthdate, dob]\n",
    "    - language: es\n",
    "      context: [fecha, tiempo, hora, nacimiento, dob]\n",
    "    type: predefined\n",
    "\n",
    "  - name: EmailRecognizer\n",
    "    supported_languages:\n",
    "    - language: en\n",
    "      context: [email, mail, address]\n",
    "    - language: es\n",
    "      context: [correo, electrónico, email]\n",
    "    type: predefined\n",
    "    \n",
    "  - name: PhoneRecognizer\n",
    "    type: predefined\n",
    "    supported_languages:\n",
    "    - language: en\n",
    "      context: [phone, number, telephone, fax]\n",
    "    - language: es\n",
    "      context: [teléfono, número, fax]\n",
    "    \n",
    "  - name: \"Titles recognizer (en)\"\n",
    "    supported_language: \"en\"\n",
    "    supported_entity: \"TITLE\"\n",
    "    deny_list:\n",
    "      - Mr.\n",
    "      - Mrs.\n",
    "      - Ms.\n",
    "      - Miss\n",
    "      - Dr.\n",
    "      - Prof.\n",
    "      - Doctor\n",
    "      - Professor\n",
    "  - name: \"Titles recognizer (es)\"\n",
    "    supported_language: \"es\"\n",
    "    supported_entity: \"TITLE\"\n",
    "    deny_list:\n",
    "      - Sr.\n",
    "      - Señor\n",
    "      - Sra.\n",
    "      - Señora\n",
    "      - Srta.\n",
    "      - Señorita\n",
    "      - Dr.\n",
    "      - Doctor\n",
    "      - Doctora\n",
    "      - Prof.\n",
    "      - Profesor\n",
    "      - Profesora\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49d6b8d7-dec7-4d0f-932b-5795e1b665bf",
   "metadata": {},
   "source": [
    "### NLP Engine parameters\n",
    "([default file](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default.yaml))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "daef9a20-e126-483f-9f1f-7d29f0b5f6f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "nlp_engine_yaml = \"\"\"\n",
    "nlp_configuration:\n",
    "    nlp_engine_name: transformers\n",
    "    models:\n",
    "      -\n",
    "        lang_code: en\n",
    "        model_name:\n",
    "          spacy: en_core_web_sm\n",
    "          transformers: StanfordAIMI/stanford-deidentifier-base\n",
    "      -\n",
    "        lang_code: es\n",
    "        model_name:\n",
    "          spacy: es_core_news_sm\n",
    "          transformers: MMG/xlm-roberta-large-ner-spanish  \n",
    "    ner_model_configuration:\n",
    "      labels_to_ignore:\n",
    "      - O\n",
    "      aggregation_strategy: first # \"simple\", \"first\", \"average\", \"max\"\n",
    "      stride: 16\n",
    "      alignment_mode: expand # \"strict\", \"contract\", \"expand\"\n",
    "      model_to_presidio_entity_mapping:\n",
    "        PER: PERSON\n",
    "        PERSON: PERSON\n",
    "        LOC: LOCATION\n",
    "        LOCATION: LOCATION\n",
    "        GPE: LOCATION\n",
    "        ORG: ORGANIZATION\n",
    "        ORGANIZATION: ORGANIZATION\n",
    "        NORP: NRP\n",
    "        AGE: AGE\n",
    "        ID: ID\n",
    "        EMAIL: EMAIL\n",
    "        PATIENT: PERSON\n",
    "        STAFF: PERSON\n",
    "        HOSP: ORGANIZATION\n",
    "        PATORG: ORGANIZATION\n",
    "        DATE: DATE_TIME\n",
    "        TIME: DATE_TIME\n",
    "        PHONE: PHONE_NUMBER\n",
    "        HCW: PERSON\n",
    "        HOSPITAL: LOCATION\n",
    "        FACILITY: LOCATION\n",
    "        VENDOR: ORGANIZATION\n",
    "        MISC: ID\n",
    "    \n",
    "      low_confidence_score_multiplier: 0.4\n",
    "      low_score_entity_names:\n",
    "      - ID\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a827c70c-c39c-4399-97a4-0bd41040fb6c",
   "metadata": {},
   "source": [
    "Create a unified YAML file and save it as a temp file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "9acd2c0d-793a-4a33-8c9e-622d01057f79",
   "metadata": {},
   "outputs": [],
   "source": [
    "full_config = f\"{analyzer_config_yaml}\\n{recognizer_registry_config_yaml}\\n{nlp_engine_yaml}\"\n",
    "\n",
    "with tempfile.NamedTemporaryFile(mode='w+', delete=False, suffix='.yaml') as temp_file:\n",
    "    # Write the YAML string to the temp file\n",
    "    temp_file.write(full_config)\n",
    "    temp_file_path = temp_file.name\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f3957de3-88b8-4aa8-844a-e31f84c26711",
   "metadata": {},
   "source": [
    "Pass the YAML file to `AnalyzerEngineProvider` to create an `AnalyzerEngine` instance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b25df9cd-483f-473a-be0a-3c7b0e1b4e4f",
   "metadata": {},
   "outputs": [],
   "source": [
    "analyzer_engine = AnalyzerEngineProvider(analyzer_engine_conf_file=temp_file_path).create_engine()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "06fbe6f4-cefe-4875-9ddf-db8b50056392",
   "metadata": {},
   "source": [
    "Print the loaded configuration for both languages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ee8cb346-5dc1-4b08-9ee4-6a71308c10d1",
   "metadata": {},
   "outputs": [],
   "source": [
    "for lang in (\"en\", \"es\"):\n",
    "    pprint(f\"Supported entities for {lang}:\")\n",
    "    print(\"\\n\")\n",
    "    pprint(analyzer_engine.get_supported_entities(lang), compact=True)\n",
    "    \n",
    "    print(f\"\\nLoaded recognizers for {lang}:\")\n",
    "    pprint([rec.name for rec in analyzer_engine.registry.get_recognizers(lang, all_fields=True)], compact=True)\n",
    "    print(\"\\n\")\n",
    "   \n",
    "print(f\"\\nLoaded NER models:\")\n",
    "pprint(analyzer_engine.nlp_engine.models)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c6a50606-a68a-4984-995b-36c9ccdccb81",
   "metadata": {},
   "outputs": [],
   "source": [
    "es_text = \"Hola, me llamo David Johnson y soy originalmente de Liverpool. Mi número de tarjeta de crédito es 4095260993934932\"\n",
    "analyzer_engine.analyze(es_text, language=\"es\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ea24b036-c65f-4376-b545-76089e7dddef",
   "metadata": {},
   "outputs": [],
   "source": [
    "en_text = \"Hi, my name is David Johnson and I'm originally from Liverpool. My credit card number is 4095260993934932\"\n",
    "analyzer_engine.analyze(en_text, language=\"en\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "681d2542-0f22-4841-9217-a64f72a12c84",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "presidio_e2e",
   "language": "python",
   "name": "presidio"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
